21:24
2026-06-29
lesswrong.com
ai-safety
Role confusion: sounding like the cause is indistinguishable from being it.
A replication of the 2026 paper 'Prompt Injection as Role Confusion' on a single consumer GPU confirms that style-based prompt injection attacks work, but causal tests using activation steering and paโฆ